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1.0  INTRODUCTION 


The  goal  of  the  University  of  Washington  effort  under  DIESEL  is  to  develop  a  unified  approach 
to  entity,  schema  and  concept  matching.  Entity  resolution  is  the  problem  of  determining  which 
mentions  in  the  data  correspond  to  the  same  object  (e.g.,  “J.  Smith”  and  “Jane  Smith”  may  be  the 
same  person).  Schema  matching  is  the  problem  of  detennining  which  fields  in  a  database  or  other 
structure  correspond  to  the  same  attributes  (e.g.,  “Contact”  and  “Telephone”  may  be  the  same 
attribute).  Concept  matching  (a.k.a.  ontology  alignment)  is  the  problem  of  determining  which 
concepts  in  two  taxonomies  correspond  to  each  other  (e.g.,  “Faculty”  in  one  taxonomy  may 
mean  the  same  as  “Staff’  in  another).  To  date,  each  of  these  problems  has  been  addressed 
separately,  assuming  that  the  other  two  have  been  solved  a  priori  (e.g.,  schema  matching  may  be 
performed  assuming  that  objects  and  concepts  have  already  been  resolved).  In  most  cases, 
however,  all  three  problems  are  present  simultaneously,  and  a  truly  robust  and  widely  applicable 
information  integration  system  therefore  needs  to  solve  the  three  simultaneously. 

We  successfully  developed  the  approach  we  planned,  as  described  in  a  series  of  papers  [1,2, 
3],  building  on  our  earlier  work  on  entity  resolution  [4,  5,  6].  Our  approach  uses  Markov  logic 
and  a  combination  of  existing  and  new  learning  and  inference  algorithms  for  it  [7].  The  key  idea 
is  to  leverage  joint  inference,  gradually  propagating  information  from  easier  to  harder  matches. 
For  example,  if  two  fields  are  the  same,  then  perhaps  the  corresponding  objects  are  the  same,  and 
maybe  the  concepts  they  instantiate  are  also  the  same.  We  developed  both  supervised  and 
unsupervised  approaches  (i.e.,  with  and  without  labeled  data),  and  observation-level  and  object- 
level  approaches  (i.e.,  inferring  equality  of  observations  vs.  inferring  their  membership  in 
objects,  relations,  etc.).  Generally  speaking,  unsupervised  object-level  matching  is  the  superior 
approach,  and  the  one  we  would  recommend  a  priori.  We  also  studied  the  incorporation  of 
background  knowledge  into  the  matching  process,  and  we  found  that  it  is  extremely  helpful,  in 
the  sense  that  a  small  amount  of  easily-stated  knowledge  can  go  a  long  way  toward  ensuring 
accurate  matching.  We  also  studied  how  background  knowledge  can  be  efficiently  induced  from 
data  if  it  is  not  known  a  priori,  and  found  that  the  learned  knowledge  correctly  captures  the 
regularities  in  the  data,  and  helps  in  ensuring  good  matches. 

We  begin  by  briefly  reviewing  some  background  on  Markov  logic.  Then  we  describe  the 
three  systems  we  developed  in  detail,  and  present  their  experimental  results.  Finally  we  conclude 
with  some  recommendations. 
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2.0  BACKGROUND  ON  MARKOV  LOGIC 


Markov  logic  networks  (MLNs)  combine  logic  and  probability  by  attaching  weights  to  first-order 
logic  rules  [8],  and  viewing  these  as  templates  for  features  of  Markov  networks  [9]. 

In  first-order  logic,  formulas  are  constructed  using  four  types  of  symbols:  constants, 
variables,  functions,  and  predicates.  Constants  represent  objects  in  the  domain  of  discourse  (e.g., 
people:  (Anna,  Bob,  etc.).  Variables  (e.g.,  jc,  y)  range  over  the  objects  in  the  domain.  Predicates 
represent  relations  among  objects  (e.g.,  Friends),  or  attributes  of  objects  (e.g.,  Student). 
Variables  and  constants  may  be  typed.  An  atom  is  a  predicate  symbol  applied  to  a  list  of 
arguments,  which  may  be  variables  or  constants  (e.g.,  Friends ( Anna, x)).  (In  this  report,  we  use 
predicate  and  relation  interchangeably.)  A  ground  atom  is  an  atom  all  of  whose  arguments  are 
constants  (e.g.,  Friends(Anna,Bob)).  A  world  is  an  assignment  of  truth  values  to  all  possible 
ground  atoms.  A  database  is  a  partial  specification  of  a  world;  each  atom  in  it  is  true,  false  or 
(implicitly)  unknown.  A  clause  is  a  disjunction  of  non-negated/negated  atoms. 

A  Markov  network  or  Markov  random  field  is  a  model  for  the  joint  distribution  of  a  set  of 
variables  X  —  (Xlt  ...,Xn)  G  X.  It  is  composed  of  an  undirected  graph  G  and  a  set  of  potential 
functions  <pk.  The  graph  has  a  node  for  each  variable,  and  the  model  has  a  potential  function  for 
each  clique  in  the  graph.  A  potential  function  is  a  non-negative  real-valued  function  of  the  state 
of  the  corresponding  clique.  The  joint  distribution  represented  by  a  Markov  network  is  given  by 
P(X  —  x)  —  in*  (Pk(x{k})  where  (pk(.x{k})  is  the  state  of  the  kth  clique  (i.e.,  the  state  of  the 
variables  that  appear  in  that  clique).  Z,  known  as  the  partition  function,  is  given  by  Z  = 
^xex  Ilfc  (Pk(x{k0-  Markov  networks  are  often  conveniently  represented  as  log-linear  models, 
with  each  clique  potential  replaced  by  an  exponentiated  weighted  sum  of  features  of  the  state, 
leading  to  P(X  —  x)  —  |  exp (2/ w,-/}(x)).  A  feature  may  be  any  real-valued  function  of  the 
state.  This  report  will  focus  on  binary  features  fj(x)  G  {0,1}.  In  the  most  direct  translation  from 
the  potential-function  form,  there  is  one  feature  corresponding  to  each  possible  state  x^  of  each 
clique,  with  its  weight  being  log  <pki.x{k])  ■  This  representation  is  exponential  in  the  size  of  the 
cliques.  However,  we  are  free  to  specify  a  much  smaller  number  of  features  (e.g.,  logical 
functions  of  the  state  of  the  clique),  allowing  for  a  more  compact  representation  than  the 
potential-function  form,  particularly  when  large  cliques  are  present.  Markov  logic  takes 
advantage  of  this. 

A  Markov  logic  network  (MLN)  is  a  set  of  weighted  first-order  formulas.  Together  with  a  set 
of  constants  representing  objects  in  the  domain,  it  defines  a  Markov  network  with  one  node  per 
ground  atom  and  one  feature  per  ground  formula.  The  weight  of  a  feature  is  the  weight  of  the 
first-order  formula  that  originated  it.  The  probability  distribution  over  possible  worlds  x 
specified  by  the  ground  Markov  network  is  given  by  P(X  —  x)  —  ^exp(XieFX;EG.  wigj(xf), 
where  Z  is  the  partition  function,  F  is  the  set  of  all  first-order  fonnulas  in  the  MLN,  6}  is  the  set 
of  groundings  of  the  ith  first-order  fonnula,  and  gj  (x)  =  1  if  the  j th  ground  formula  is  true  and 
gj(x)  =  0  otherwise.  Markov  logic  enables  us  to  compactly  represent  complex  models  in  non- 
i.i.d.  domains.  General  algorithms  for  inference  and  learning  in  Markov  logic  are  discussed  in 

m. 
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3.0  SEMANTIC  NETWORK  EXTRACTOR 


3.1  System  Description 

Our  Semantic  Network  Extractor  (SNE)  system  [1]  jointly  clusters  objects  (entities)  and  relations 
(schemas/concepts)  in  an  unsupervised  manner,  without  requiring  the  number  of  clusters  to  be 
specified  in  advance.  SNE  does  so  by  allowing  information  from  object  clusters  it  has  created  at 
each  step  to  be  used  in  fonning  relation  clusters,  and  vice  versa.  The  object  clusters  and  relation 
clusters  respectively  form  the  nodes  and  links  of  a  semantic  network.  A  link  exists  between  two 
nodes  if  and  only  if  a  true  ground  fact  can  be  formed  from  the  symbols  in  the  corresponding 
relation  and  object  clusters. 

SNE  is  defined  using  finite  second-order  Markov  logic  in  which  variables  can  range  over 
relations  (predicates)  as  well  as  objects  (constants).  Extending  Markov  logic  to  second  order 
involves  simply  grounding  atoms  with  all  possible  predicate  symbols  as  well  as  all  constant 
symbols,  and  allows  us  to  represent  some  models  much  more  compactly  than  first-order  Markov 
logic. 

In  SNE,  we  assume  that  relations  are  binary,  i.e.,  relations  are  of  the  form  r(x,y)  where  r  is 
a  relation  symbol,  and  x  and  y  are  object  symbols.  We  use  yt  and  /)  to  respectively  denote  a 
cluster  and  clustering  (i.e.,  a  partitioning)  of  symbols  of  type  i.  If  r,  x,  and  y  are  respectively  in 
cluster  yr,  yx,  and  yy,  we  say  that  r(x,y)  is  in  the  cluster  combination  (yr,  yx,  Yy)-  The  learning 
problem  in  SNE  consists  of  finding  the  cluster  assignment  P  =  (Tr,  rx,  Py)  that  maximizes  the 
posterior  probability  P(P|D)  oc  P(P,D)  =  P(P)P(D|P)  where  D  is  a  vector  of  truth 
assignments  to  the  observable  r(x,y)  ground  atoms. 

We  define  one  MLN  for  the  likelihood  P{D  |P)  component,  and  one  MLN  for  the  prior  P(P) 
component  of  the  posterior  probability  with  just  four  simple  rules. 

The  MLN  for  the  likelihood  component  only  contains  one  rule  stating  that  the  truth  value  of 
an  atom  is  detennined  by  the  cluster  combination  it  belongs  to: 

Vr,  x,  y,  +yr,  +yx,  +yy  rEyrAxEyxAyEyy=>  r(x,  y) 

The  notation  is  syntactic  sugar  that  signifies  that  there  is  an  instance  of  this  rule  with  a 
separate  weight  for  each  cluster  combination  (yr,Yx>Yy)-  This  rule  predicts  the  probability  of 
query  atoms  given  the  cluster  memberships  of  the  symbols  in  them.  This  is  known  as  the  atom 
prediction  rule. 

Three  rules  are  defined  in  the  MLN  for  the  prior  component.  The  first  rule  states  that  each 
symbol  belongs  to  exactly  one  cluster: 


Vx  3 1  y  x  E  y 

This  rule  is  hard,  i.e.,  it  has  infinite  weight  and  cannot  be  violated. 

The  second  rule  imposes  an  exponential  prior  on  the  number  of  cluster  combinations.  This 
rule  combats  the  proliferation  of  cluster  combinations  and  consequent  over  fitting,  and  is 
represented  by  the  formula 

v Yr'Vx>Yy  3  r,x,y  r  Eyr  Ax  Eyx  Ay  Eyy 
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with  negative  weight  —A.  The  parameter  A  is  fixed  during  learning,  and  is  the  penalty  in  log- 
posterior  incurred  by  adding  a  cluster  combination  to  the  model.  Thus  larger  As  lead  to  fewer 
cluster  combinations  being  formed.  This  rule  represents  the  complexity  of  the  model  in  terms  of 
the  number  of  instances  of  the  atom  prediction  rule  (which  is  equal  to  the  number  of  cluster 
combinations). 

The  last  rule  encodes  the  belief  that  most  symbols  tend  to  be  in  different  clusters.  It  is 
represented  by  the  formula 


Vx,x', Yx, Yx'  x  e  Yx  A x'  G  Yx  A x  A  x  =>  yx  A  Yx' 

with  positive  weight  /r.  The  parameter  p.  is  also  fixed  during  learning.  We  expect  there  to  be 
many  concepts  and  high-level  relations  in  a  large  heterogeneous  body  of  data.  If  the  tuple 
extraction  process  samples  instances  of  these  concepts  and  relations  sparsely,  and  we  expect  each 
concept  or  relation  to  have  only  a  few  instances  sampled,  in  many  cases  only  one.  Thus  we 
expect  most  pairs  of  symbols  to  be  in  different  concept  and  relation  clusters. 

SNE  simplifies  the  learning  problem  by  performing  hard  assignment  of  symbols  to  clusters 
(i.e.,  instead  of  computing  probabilities  of  cluster  membership,  a  symbol  is  simply  assigned  to  its 
most  likely  cluster).  This  allows  the  maximum  a  posteriori  (MAP)  weights  of  the  atom  prediction 
rules,  and  the  MAP  log-posterior  to  be  computed  in  closed  form.  The  equation  for  the  log- 
posterior,  as  defined  by  the  two  MLNs,  can  be  written  in  closed  form  as 

log  P(r|  fi)  =  2  [t*  log  +  A'°S  (,k+%L?)\  -  Im  +  ^d  +  C  (1) 

kEK 

where  K  is  the  set  of  cluster  combinations;  tk  and  fk  are  respectively  the  number  of  true  and 
false  ground  atoms  in  cluster  combination  k ;  a  and  /?  are  smoothing  parameters;  m  is  the  number 
of  cluster  combinations,  d  is  the  number  of  pairs  of  symbols  that  belong  to  different  clusters,  and 
C  is  a  constant. 
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Since  the  log-posterior  can  be  computed  in  closed-form,  SNE  simply  searches  over  cluster 
assignments,  evaluating  each  assignment  by  its  posterior  probability.  (To  speed  up  the 
computation  of  Equation  2,  we  make  an  approximation  to  it.  Please  refer  to  [1]  for  details.)  SNE 
uses  a  bottom-up  agglomerative  clustering  algorithm  to  find  the  MAP  clustering.  The  algorithm 
begins  by  assigning  each  symbol  to  its  own  unit  cluster.  Next  we  try  to  merge  pairs  of  clusters  of 
each  type.  We  create  candidate  pairs  of  clusters,  and  for  each  of  them,  we  evaluate  the  change  in 
posterior  probability  (Eqn.  2)  if  the  pair  is  merged.  If  the  candidate  pair  improves  posterior 
probability,  we  store  it  in  a  sorted  list.  We  then  iterate  through  the  list,  performing  the  best 
merges  first,  and  ignoring  those  containing  clusters  that  have  already  been  merged.  In  this 
manner,  we  incrementally  merge  clusters  until  no  merges  can  be  performed  to  improve  posterior 
probability.  To  avoid  creating  all  possible  candidate  pairs  of  clusters  of  each  type  (which  is 
quadratic  in  the  number  of  clusters),  we  make  use  of  canopies  [10].  A  canopy  for  relation 
symbols  is  a  set  of  clusters  such  that  there  exist  object  clusters  yx  and  yy,  and  for  all  clusters  yr  in 
the  canopy,  the  cluster  combination  (yr,yx>Yy)  contains  at  least  one  true  ground  atom  r(x,  y). 
We  say  that  the  clusters  in  the  canopy  share  the  property  (yx,  yy).  Canopies  for  object  symbols  x 
and  y  are  similarly  defined.  We  only  try  to  merge  clusters  in  a  canopy  that  is  no  larger  than  a 
parameter  CanopyMax.  This  parameter  limits  the  number  of  candidate  cluster  pairs  we  consider 
for  merges,  making  our  algorithm  more  tractable.  Furthermore,  by  using  canopies,  we  only  try 
“good”  merges,  because  symbols  in  clusters  that  share  a  property  are  more  likely  to  belong  to  the 
same  cluster  than  those  in  clusters  with  no  property  in  common. 

3.2  Experiments 

We  conducted  experiments  to  investigate  the  efficacy  of  jointly  clustering  relations  and  objects 
vis-a-vis  clustering  them  separately  (i.e.,  clustering  relations  but  not  objects,  and  vice  versa).  We 
also  investigated  the  effectiveness  of  SNE  against  three  other  relational  clustering  systems,  viz., 
Multiple  Relational  Clusterings  (MRC),  Information-Theoretic  Co-clustering  (ITC),  and  Infinite 
Relational  Model  (IRM). 

All  experiments  were  conducted  on  a  large  Web  dataset  consisting  of  2.1  million  r(x,y) 
triples  (publicly  available  at  http://knight.cis.temple.edu-/~yates/data/resolver_data.tar.gz) 
extracted  in  a  Web  crawl  by  the  information  extraction  system  TextRunner  [11].  Each  triple 
takes  the  form  r(x,y)  where  r  is  a  relation  symbol,  and  xand  y  are  object  symbols.  Some 
example  triples  are:  named  after  (Jupiter,  Roman _god)  and  upheld  (Court,  ruling).  There  are 
15,872  distinct  symbols,  700,781  distinct  x  symbols,  and  665,378  distinct  ysymbols.  Two 
characteristics  of  TextRunner’s  extractions  are  that  they  are  sparse  and  noisy.  To  reduce  the  noise 
in  the  dataset,  we  only  considered  symbols  that  appeared  at  least  25  times.  This  leaves  10,214  r 
symbols,  8942  x  symbols,  and  7995  y  symbols.  There  are  2,065,045  triples  that  contain  at  least 
one  symbol  that  appears  at  least  25  times.  In  all  experiments,  we  set  the  CanopyMax  parameter 
to  50.  We  also  made  the  closed-world  assumption  for  all  systems  (i.e.,  all  triples  not  in  the 
dataset  are  assumed  false).  Because  the  other  relational  clustering  systems  do  not  scale  to  the 
Web  dataset,  we  had  to  modify  them  to  use  SNE’s  search  algorithm.  We  also  limited  MRC  to 
find  a  single  clustering  (it  is  able  to  find  multiple)  for  an  apple-to-apple  comparison  with  SNE. 
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We  evaluated  the  clustering’s  learned  by  each  model  against  a  gold  standard  that  we  manually 
created.  The  gold  standard  assigns  2688  r  symbols,  2568  x  symbols,  and  3058  y  symbols  to  874, 
511,  and  700  non-unit  clusters  respectively.  We  measured  the  pairwise  precision,  recall  and  FI 
of  each  model  against  the  gold  standard.  Pairwise  precision  is  the  fraction  of  symbol  pairs  in 
learned  clusters  that  appear  in  the  same  gold  clusters.  Pairwise  recall  is  the  fraction  of  symbol 
pairs  in  gold  clusters  that  appear  in  the  same  learned  clusters.  FI  is  the  harmonic  mean  of 
precision  and  recall. 

Figure  1  shows  a  snippet  of  the  semantic  network  learned  by  SNE.  Table  1  shows  the 
performance  of  SNE  when  it  clusters  relations  and  objects  jointly  and  when  it  clusters  them 
separately.  From  that  figure,  we  can  see  that  SNE  has  the  better  overall  FI  when  it  clusters 
relations  and  objects  jointly  (SNE-Sep).  We  show  the  best  FIs  in  bold.  Table  2  compares 
performance  of  SNE  to  those  of  three  other  relational  clustering  systems,  and  shows  that  SNE 
has  the  best  overall  FI  score.  From  Table  3  which  shows  the  runtimes  of  the  various  systems,  we 
see  that  SNE  scales  well  relative  to  the  other  systems.  We  also  evaluated  the  systems  in  terms  of 
the  semantic  statements  that  they  learned  where  a  semantic  statement  is  a  cluster  combination 
with  one  true  ground  atom.  We  found  that  SNE  outperforms  the  other  systems  in  tenns  of  the 
fraction  of  correct  semantic  statements  discovered  (see  [1]  for  details).  We  also  found  the 
clusters  discovered  by  SNE  agree  well  with  those  in  a  publicly  available  ontology  WordNet  [12]. 

Table  1.  Performance  when  SNE  Clusters  Relations  and  Objects  Jointly  and 
Separately  (SNE-Sep) 


Relation 

Object 

Systems 

Precision 

Recall 

FI 

Precision 

Recall 

FI 

SNE 

0.452 

0.187 

0.265 

0.509 

0.062 

0.110 

SNE-Sep 

0.597 

0.116 

0.194 

0.535 

0.046 

0.085 

Table  2.  Performance  of  SNE  and  Three  Other  Relational  Clustering  Systems 


Systems 

Relation 

Object 

Precision 

Recall 

FI 

Precision 

Recall 

FI 

SNE 

0.452 

0.187 

0.265 

0.509 

0.062 

0.110 

IRM 

0.201 

0.089 

0.124 

0.280 

0.042 

0.073 

ITC 

0.773 

0.003 

0.006 

0.617 

0.025 

0.048 

MRC 

0.054 

0.044 

0.049 

0.045 

0.009 

0.015 

Table  3.  Runtimes  of  SNE  and  Three  Other  Relational  Clustering  Systems 
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Systems 

Runtimes 

(hrs) 

SNE 

5.5 

IRM 

9.5 

ITC 

72.0 
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Figure  1:  Snippet  of  Semantic  Network  Learned  by  SNE 


4.0  JOINT  UNSUPERVISED  COREFERENCE  RESOLUTION 
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4.1  System  Description 

In  this  system,  we  demonstrate  how  we  can  easily  add  background  knowledge  using  Markov 
logic  to  improve  the  matching  of  entities.  We  tested  the  efficacy  of  our  system  on  the  problem  of 
coreference  resolution,  i.e.,  identifying  mentions  (typically  noun  phrases)  that  refer  to  the  same 
entities.  This  is  a  key  sub-problem  in  many  natural  language  processing  (NLP)  applications, 
including  information  extraction,  question  answering,  machine  translation,  etc. 

Supervised  learning  approaches  treat  the  problem  as  one  of  classification:  for  each  pair  of 
mentions,  predict  whether  they  corefer  or  not  [13].  While  successful,  these  approaches  require 
labeled  training  data,  consisting  of  mention  pairs  and  the  correct  decisions  for  them.  This  limits 
their  applicability.  Unsupervised  approaches  are  attractive  due  to  the  availability  of  large 
quantities  of  unlabeled  text.  However,  unsupervised  coreference  resolution  is  much  more 
difficult.  The  most  sophisticated  model  to  date  proposed  by  [14]  still  lags  supervised  ones  by  a 
substantial  margin.  The  lack  of  label  infonnation  in  unsupervised  coreference  resolution  can 
potentially  be  overcome  by  performing  joint  inference,  which  leverages  the  “easy”  decisions  to 
help  make  related  “hard”  ones.  Relations  that  have  been  exploited  in  supervised  coreference 
resolution  include  transitivity  and  anaphor  city.  (Transitivity  refers  to  the  condition  where  if 
mentions  A  and  B  corefer,  and  B  and  C  corefer,  then  A  and  C  corefer.  Anaphoricity  refers  to  the 
condition  where  a  linguistic  unit  (e.g.,  pronoun)  refers  back  to  another  unit  as  in  the  use  of  him  to 
refer  to  Alan  in  the  sentence  Alan  told  Betty  to  get  him  some  candy.)  However,  there  is  little  work 
to  date  on  joint  inference  for  unsupervised  resolution.  We  address  this  problem  using  Markov 
logic,  which  allows  us  to  easily  build  models  involving  relations  among  mentions,  like 
apposition  and  predicate  nominal’s.  By  extending  the  state-of-the-art  algorithms  for  inference 
and  learning  in  Markov  logic,  we  developed  the  first  general-purpose  unsupervised  learning 
algorithm,  and  applied  it  to  unsupervised  coreference  resolution. 

We  incrementally  create  more  sophisticated  MLNs  for  coreference  resolution  to  illustrate  the 
ease  of  specifying  models  in  Markov  logic. 

4.1.1  Base  MLN 

The  main  query  predicate  is  InClust(m,  c\),  which  is  true  if  and  only  if  mention  m  is  in  cluster 
c.  (A  query  predicate  is  a  predicate  whose  value  we  do  not  know  at  test  time,  would  like  to 
infer.)  The  “c!”  notation  signifies  that  for  each  m,  this  predicate  is  true  for  a  unique  value  of  c. 
The  main  evidence  predicate  is  Head(m,  t!),  where  m  is  a  mention  and  t  a  token,  and  which  is 
true  if  and  only  if  t  is  the  head  of  m.  A  key  component  in  our  MLN  is  a  simple  head  mixture 
model,  where  the  mixture  component  priors  are  represented  by  the  unit  clause  lnClust^+m,  +c) 
and  the  head  distribution  is  represented  by  the  head  prediction  rule 

InClust(m,+c )  A  Head(m,+t). 
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All  free  variables  are  implicitly  universally  quantified.  The  “+”  notation  signifies  that  the  MLN 
contains  an  instance  of  the  rule,  with  a  separate  weight,  for  each  value  combination  of  the 
variables  with  a  plus  sign.  By  convention,  at  each  inference  step  we  name  each  non-empty 
cluster  after  the  earliest  mention  it  contains.  This  helps  break  the  symmetry  among  mentions, 
which  otherwise  produces  multiple  optima  and  makes  learning  unnecessarily  harder.  To 
encourage  clustering,  we  impose  an  exponential  prior  on  the  number  of  non-empty  clusters  with 
weight  -1.  The  above  model  only  clusters  mentions  with  the  same  head,  and  does  not  work  well 
for  pronouns.  To  address  this,  we  introduce  the  predicate  IsPrn(m),  which  is  true  if  and  only  if 
the  mention  m  is  a  pronoun,  and  adapt  the  head  prediction  rule  as  follows: 

-i IsPrn(m)  A  InClust(m,  +c)  A  Head(m,  +t ) 

This  is  always  false  when  m  is  a  pronoun,  and  thus  applies  only  to  non-pronouns.  Pronouns  tend 
to  resolve  with  mentions  that  are  semantically  compatible  with  them.  Thus  we  introduce 
predicates  that  represent  entity  type,  number,  and  gender:  Type(x,  e!),  Number(x,n\ ), 
Gender(x,g\),  where  x  can  be  either  a  cluster  or  mention,  n  E  [Singular,  Plural},  eE 
[Person,  Organization,  Location,  Other},  and  g  E  {Male,  Female,  Neuter}.  Many  of  these 
are  known  for  pronouns,  and  some  can  be  inferred  from  simple  linguistic  cues  (e.g.,  “Ms.  Galen ” 
is  a  singular  female  person,  while  “XYZ  Corp .”  is  an  organization).  (We  used  the  following  cues: 
Mr.,  Ms.,  Jr.,  Inc.,  Corp.,  corporation,  and  company .)  Entity  type  assignment  is  represented  by 
the  unit  clause  Type(+x,  +e),  and  similarly  for  number  and  gender.  A  mention  should  agree 
with  its  cluster  in  entity  type.  This  is  ensured  by  the  hard  rule  (which  has  infinite  weight  and 
must  be  satisfied) 


InClust(m,c )  =>  (  Type(jn,e )  <=>  Type(c,e )) 

There  are  similar  hard  rules  for  number  and  gender. 

Different  pronouns  prefer  different  entity  types,  as  represented  by 

IsPrn(m )  A  InClust(m,  c)  A  Head(m,  +t)  A  Type(c,  +e ) 

which  only  applies  to  pronouns,  and  whose  weight  is  positive  if  pronoun  t  is  likely  to  assume 
entity  type  e  and  negative  otherwise.  There  are  similar  rules  for  number  and  gender.  Aside  from 
semantic  compatibility,  pronouns  tend  to  resolve  with  nearby  mentions.  To  model  this,  we 
impose  an  exponential  prior  on  the  distance  (number  of  mentions)  between  a  pronoun  and  its 
antecedent,  with  weight  - 1 . 
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4.1.2  Full  MLN 


Syntactic  relations  among  mentions  often  suggest  coreference.  Incorporating  such  relations  into 
our  MLN  is  straightforward.  We  illustrate  this  with  two  examples:  apposition  and  predicate 
nominals.  We  introduce  a  predicate  for  apposition,  Appo(x,y ),  where  x,  y  are  mentions,  and 
which  is  true  if  and  only  if  y  is  an  appositive  of  x.  We  then  add  the  rule 

Appo(x,y )  =>  ( InClust(x,c )  <=>  InClust(y,  c)) 

which  ensures  that  x,  y  are  in  the  same  cluster  if  y  is  an  appositive  of  x.  Similarly,  we  introduce 
a  predicate  for  predicate  nominals,  PrecLNom(x,y ),  and  the  corresponding  rule.  The  weights  of 
both  rules  can  be  learned  from  data  with  a  positive  prior  mean.  For  simplicity,  in  this  paper  we 
treat  them  as  hard  constraints. 


4.1.3  Extensions  to  Weight  Learning  and  Inference 

In  order  to  apply  existing  Markov  logic  inference  and  learning  algorithms  to  the  problem  of 
unsupervised  coreference  resolution,  we  had  to  extend  them.  Unsupervised  learning  in  Markov 
logic  maximizes  the  conditional  log-likelihood 


L(x,y)  =  logP(F  =  y\X  =  x)  =  log^  P(F  =  y,Z  —  z\X  =  x) 


where  Z  are  unknown  predicates.  In  our  coreference  resolution  MLN,  Y  includes  Head  and 
known  groundings  of  Type,  Number  and  Gender;  Z  includes  InClust  and  unknown 
groundings  of  Type,  Number,  Gender;  and  X  includes  IsPrn,  Appo  and  PredNom.  (For 
simplicity,  from  now  on  we  drop  x  from  the  formula.)  With  Z,  the  optimization  problem  is  no 
longer  convex.  However,  we  can  still  find  a  local  optimum  using  gradient  descent,  with  the 
gradient  being 


—  L(y)  =  Ez) y[nf]  -  E YiZ[n{\ 

where  nt  is  the  number  of  true  groundings  of  the  ith  clause.  We  extended  PSCG  for 
unsupervised  learning.  The  gradient  is  the  difference  of  two  expectations,  each  of  which  can  be 
approximated  using  samples  generated  by  MC-SAT  [15].  The  (i,j)  th  entry  of  the  Hessian  is  now 

d2  r 

g  g  L(y )  =  CovZ|y[ni,n;J  -  CovF;Z [nj,n;] 

J 

and  the  step  size  can  be  computed  accordingly.  Since  our  problem  is  no  longer  convex,  the 
negative  diagonal  Hessian  may  contain  zero  or  negative  entries,  so  we  first  took  the  absolute 
values  of  the  diagonal  and  added  1,  then  used  the  inverse  as  the  preconditioner.  Notice  that  when 
the  objects  form  independent  subsets  (in  our  cases,  mentions  in  each  document),  we  can  process 
them  in  parallel  and  then  gather  sufficient  statistics  for  learning.  We  developed  an  efficient 
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parallelized  implementation  of  our  unsupervised  learning  algorithm  using  the  message-passing 
interface  (MPI).  To  reduce  burn-in  time,  we  initialized  MC-SAT  with  the  state  returned  by 
MaxWalkSAT  [16],  rather  than  a  random  solution  to  the  hard  clauses.  In  the  existing 
implementation  in  Alchemy  [17],  SampleSAT  [18]  flips  only  one  atom  in  each  step,  which  is 
inefficient  for  predicates  with  unique-value  constraints  (e.g.,  Head(jn,  c!)).  Such  predicates  can 
be  viewed  as  multi-valued  predicates  (e.g.,  HeacL(jn)  with  value  ranging  over  all  c’s)  and  are 
prevalent  in  NLP  applications.  We  adapted  SampleSAT  to  flip  two  or  more  atoms  in  each  step  so 
that  the  unique-value  constraints  are  automatically  satisfied.  By  default,  MC-SAT  treats  each 
ground  clause  as  a  separate  factor  while  detennining  the  slice.  This  can  be  very  inefficient  for 
highly  correlated  clauses.  For  example,  given  a  non-pronoun  mention  m  currently  in  cluster  c 
and  with  head  t,  among  the  mixture  prior  rules  involving  m  InClust(m,  c )  is  the  only  one  that  is 
satisfied,  and  among  those  head-prediction  rules  involving  m,  -i IsPrn(ni)  A  InClust(m,  c)  A 
Head(m,t )  is  the  only  one  that  is  satisfied;  the  factors  for  these  rules  multiply  to  0  = 
exp(wm  c  +  wmct),  where  wmc  is  the  weight  for  lnClust(m,  c),  and  wm  c  t  is  the  weight  for 
-iIsPrn(m)  A  InClust(m,c)  A  Head(m,i),  since  an  unsatisfied  rule  contributes  a  factor  of 
e°  =  1.  We  extended  MC-SAT  to  treat  each  set  of  mutually  exclusive  and  exhaustive  rules  as  a 
single  factor.  E.g.,  for  the  above  m,  MC-SAT  now  samples  it  uniformly  from  (0,0),  and 
requires  that  in  the  next  state  0  be  no  less  than  it.  Equivalently,  the  new  cluster  and  head  for  m 
should  satisfy  wmc'  +  wm c-t-  >  log  (it).  We  extended  SampleSAT  so  that  when  it  considers 
flipping  any  variable  involved  in  such  constraints  (e.g.,  c  or  t  above),  it  ensures  that  their  new 
values  still  satisfy  these  constraints.  The  final  clustering  is  found  using  the  MaxWalkSAT 
weighted  satisfiability  solver,  with  the  appropriate  extensions.  We  first  ran  a  MaxWalkSAT  pass 
with  only  finite-weight  formulas,  then  ran  another  pass  with  all  formulas.  We  found  that  this 
significantly  improved  the  quality  of  the  results  that  MaxWalkSAT  returned. 

4.2  Experiments 

We  tested  our  approach  on  MUC-6,  ACE-2004  and  ACE  Phrase-2  (ACE-2). The  MUC-6  dataset 
consists  of  30  documents  for  testing  and  221  for  training.  The  English  version  of  the  ACE-2004 
training  corpus  contains  two  sections,  BNEWS  and  NWIRE,  with  220  and  128  documents, 
respectively.  ACE-2  contains  a  training  set  and  a  test  set.  In  our  experiments,  we  only  used  the 
test  set,  which  contains  three  sections,  BNEWS,  NWIRE,  and  NPAPER,  with  51,  29,  and  17 
documents,  respectively.  We  emphasize  that  our  approach  is  unsupervised,  and  thus  the  data  only 
contains  raw  text  plus  true  mention  boundaries.  We  evaluated  our  systems  using  two  commonly- 
used  scoring  programs:  MUC  [19]  and  B 3  [20].  On  MUC-6,  we  compared  against  the  published 
results  of  the  state-of-the-art  unsupervised  system  by  [14]  (H&K),  and  against  the  state-of-the-art 
supervised  system  by  [13]  (M&W).  On  ACE-2004,  we  compared  against  the  published  results  of 
H&K.  On  ACE-2,  we  compared  against  the  published  results  of  two  supervised  systems  [21] 
(Ng)  and  [22]  (D&B). 

Table  4  shows  the  results  on  the  MUC-6  and  ACE-2004  datasets.  Our  approach  (MLN) 
outperforms  both  H&K  and  M&W  in  precision,  recall  and  FI.  Table  5  and  6  shows  the  results  on 
the  ACE-2  dataset.  Our  approach  outperforms  Ng  and  is  competitive  with  D&B  on  all  measures. 
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Table  4.  Coreference  Results  in  MUC  Scores  on  the  MUC-6  and  ACE-2004  datasets 


Systems 

MUC-6 

ACE-2004 

EN-BNEWS 

ACE-2004 

EN-NWIRE 

Precision  Recall  FI 

Precision  Recall  FI 

Precision  Recall  FI 

H&K 

M&W 

MLN 

80.4  62.4  70.3 

73.4 

83.0  75.8  79.2 

63.2  61.3  62.3 

66.8  67.8  67.3 

66.7  62.3  64.2 

71.3  70.5  70.9 

Table  5.  Coreference  Results  in  MUC  Scores  on  the  ACE-2  datasets 


Systems 

BNEWS 

NWIRE 

NPAPER 

Precision 

Recall 

FI 

Precision 

Recall 

FI 

Precision 

Recall 

FI 

Ng 

67.9 

62.2 

64.9 

60.3 

50.1 

54.7 

71.4 

67.4 

69.3 

D&B 

78.0 

62.1 

69.2 

75.8 

60.8 

67.5 

77.6 

68.0 

72.5 

MLN 

68.3 

66.6 

67.4 

67.7 

67.3 

67.4 

69.2 

71.7 

70.4 

Table  6.  Coreference  Results  in  B 3  Scores  on  the  ACE-2  datasets 


Systems 

BNEWS 

NWIRE 

NPAPER 

Precision 

Recall 

FI 

Precision 

Recall 

FI 

Precision 

Recall 

FI 

Ng 

77.1 

57.0 

65.6 

75.4 

59.3 

66.4 

75.4 

59.3 

66.4 

MLN 

70.3 

65.3 

67.7 

74.7 

68.8 

71.6 

70.0 

66.5 

68.2 
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5.0  LEARNING  MLN  STRUCTURE  VIA  HYPERGRAPH  LIFTING 


5.1  System  Description 

We  create  the  Learning  via  Hyperpgraph  Lifting  (LHL)  system  [3]  to  learn  background 
knowledge  (in  the  form  of  Markov  logic  rules)  from  data  when  it  is  not  known  a  priori.  Such 
knowledge  could  then  be  used  for  matching  entities,  schemas  and  concepts  (as  in  the  previous 
system). 

Learning  Markov  logic  rules  and  their  associated  weights  is  the  problem  of  MLN  Structure 
Learning.  To  date,  most  MLN  structure  learners  [23,  24]  systematically  enumerate  candidate 
clauses  by  starting  from  an  empty  clause,  greedily  adding  literals  to  it,  and  testing  the  resulting 
clause's  empirical  fit  to  training  data.  Such  a  strategy  has  two  shortcomings:  searching  the  large 
space  of  clauses  is  computationally  expensive;  and  it  is  susceptible  to  converging  to  a  local 
optimum,  missing  potentially  useful  clauses.  These  shortcomings  can  be  ameliorated  by  using 
the  data  to  a  priori  constrain  the  space  of  candidates.  This  is  the  basic  idea  in  relational 
pathfinding  [25],  which  finds  paths  of  true  ground  atoms  that  are  linked  via  their  arguments  and 
then  generalizes  them  into  first-order  rules.  Each  path  corresponds  to  a  conjunction  that  is  true  at 
least  once  in  the  data.  Since  most  conjunctions  are  false,  this  helps  to  concentrate  the  search  on 
regions  with  promising  rules.  However,  pathfinding  potentially  amounts  to  exhaustive  search 
over  an  exponential  number  of  paths.  Hence,  systems  using  relational  pathfinding  typically 
restrict  themselves  to  very  short  paths,  creating  short  clauses  from  them  and  greedily  joining 
them  into  longer  ones. 

Our  system  LHL  uses  relational  pathfinding  to  a  fuller  extent  than  previous  ones.  It  mitigates 
the  exponential  search  problem  by  first  inducing  a  more  compact  representation  of  data,  in  the 
form  of  a  hypergraph  over  clusters  of  constants.  Pathfinding  on  this  ‘lifted’  hypergraph  is 
typically  at  least  an  order  of  magnitude  faster  than  on  the  ground  training  data,  and  produces 
MLNs  that  are  more  accurate. 

A  hypergraph  is  a  straightforward  generalization  of  a  graph  in  which  an  edge  can  link  any 
number  of  nodes,  rather  than  just  two.  More  formally,  we  define  a  hypergraph  as  a  pair  (V,  E) 
where  V  is  a  set  of  nodes,  and  E  is  a  multiset  of  labeled  non-empty  ordered  subsets  of  V  called 
hyperedges.  In  LHL,  we  find  paths  in  a  hypergraph.  A  path  is  defined  as  a  set  of  hyperedges  such 
that  for  any  two  hyperedges  e0  and  en  in  the  set,  there  exists  an  ordering  of  (a  subset  of) 
hyperedges  in  the  set  e0,  eu  ... ,  en_;,  en  such  that  en  and  en+1  share  at  least  one  node. 

A  database  can  be  viewed  as  a  hypergraph  with  constants  as  nodes,  and  true  ground  atoms  as 
hyperedges.  Each  hyperedge  is  labeled  with  a  predicate  symbol.  Nodes  (constants)  are  linked  by 
a  hyperedge  (true  ground  atom)  if  and  only  if  they  appear  as  arguments  in  the  hyperedge. 
(Henceforth  we  use  node  and  constant  interchangeably,  and  likewise  for  hyperedge  and  true 
ground  atom)  A  path  of  hyperedges  can  be  generalized  into  a  first-order  clause  by  variabilizing 
their  arguments.  To  avoid  tracing  the  exponential  number  of  paths  in  the  hypergraph,  LHL  first 
jointly  clusters  the  nodes  into  higher-level  concepts,  and  by  doing  so  it  also  clusters  the 
hyperedges  (i.e.,  the  ground  atoms  containing  the  clustered  nodes).  The  ‘lifted’  hypergraph  has 
fewer  nodes  and  hyperedges,  and  therefore  fewer  paths,  reducing  the  cost  of  finding  them. 
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Figure  2  provides  an  example.  We  have  a  database  describing  an  academic  department  where 
professors  tend  to  have  students  whom  they  are  advising  as  teaching  assistants  (TAs)  in  the 
classes  the  professors  are  teaching.  The  left  graph  is  created  from  the  database,  and  after  lifting, 
results  in  the  right  graph.  Observe  that  the  lifted  graph  is  simpler  and  the  clustered  constants 
correspond  to  the  high-level  concepts  of  Professor,  Student,  and  Course. 


Figure  2:  Example  of  Hypergraph  Lifting 


LHL  consists  of  three  steps.  LHL  begins  by  lifting  a  hypergraph.  Then  it  finds  paths  in  the 
lifted  hypergraph.  Finally  it  creates  candidate  clauses  from  the  paths,  and  leam  their  weights  to 
create  an  MLN.  We  describe  each  step  in  turn. 

5.1.1  Hypergraph  Lifting 

We  call  our  hypergraph  lifting  algorithm  LiftGraph.  LiftGraph  is  defined  using  similar  Markov 
logic  rules  as  SNE.  It  differs  from  SNE  in  the  following  ways.  LiftGraph  can  handle  relations  of 
arbitrary  arity,  whereas  SNE  can  only  handle  binary  relations.  While  SNE  can  cluster  relation 
symbols,  in  this  report,  for  simplicity,  LiftGraph  do  not  cluster  relations.  (However,  it  is 
straightforward  to  extend  LiftGraph  to  do  so.)  LiftGraph  works  by  jointly  clustering  the 
constants  in  a  hypergraph  in  a  bottom-up  agglomerative  manner,  allowing  information  to 
propagate  from  one  cluster  to  another  as  they  are  formed.  The  number  of  clusters  need  not  be 
pre-specified.  As  a  consequence  of  clustering  the  constants,  the  ground  atoms  in  which  the 
constants  appear  are  also  clustered.  Each  hyperedge  in  the  lifted  hypergraph  contains  at  least  one 
true  ground  atom. 

We  use  the  same  notation  as  SNE.  In  addition,  we  use  r(y1, ... ,yn )  to  denote  a  hyperedge 
connecting  nodes  yv  ... ,  yn.A  hypergraph  representing  the  true  ground  atoms  r(x,  ...,xn)  in  a 
database  is  simply  ( V  =  {{xj},  E  —  {r(fx\],  ■■■>{*„}})  with  each  constant  xi  in  its  own  cluster, 
and  a  hyperedge  for  each  true  ground  atom. 

LiftGraph  simplifies  the  learning  problem  by  performing  hard  assignment  of  constant 
symbols  to  clusters  (like  SNE).  The  log-posterior  of  the  LiftGraph  model  can  now  be  computed 
in  closed  form.  LiftGraph  thus  simply  searches  over  cluster  assignments,  evaluating  each  one  by 
its  posterior  probability.  It  begins  by  assigning  each  constant  symbol  xi  to  its  own  cluster  (xj, 
and  creating  a  hyperedge  r({x1}, ... ,  {xn})  for  each  true  ground  atom  r(x1, ... ,  xn).  Next  it  creates 
candidate  pairs  of  clusters  of  each  type,  and  for  each  pair,  it  evaluates  the  gain  in  posterior 
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probability  if  its  clusters  are  merged.  It  then  chooses  the  pair  that  gives  the  largest  gain  to  be 
merged.  When  clusters  yL  and  y[  are  merged  to  form  new  yfew,  each  hyperedge 
r(yi, ... ,  yt, ...  yn)  is  replaced  with  r(y1, ... ,  ytnew  , ...  yn)  (and  similarly  for  hyperedges  containing 
y-).  Since  r(yv  ... , yt, ... yn)  contains  at  least  one  true  ground  atom,  r(y1, ... , yfiew  , ... yn)  must  do 
too.  In  this  manner,  LiftGraph  incrementally  merges  clusters  until  no  merges  can  be  perfonned  to 
improve  posterior  probability.  It  then  returns  a  lifted  hypergraph  whose  hyperedges  all  contain  at 
least  one  true  ground  atom. 

5.1.2  Path  Finding 

LHL  constructs  paths  by  starting  from  each  hyperedge  in  a  hypergraph.  It  begins  by  adding  a 
hyperedge  to  an  empty  path,  and  then  recursively  adds  hyperedges  linked  to  nodes  already 
present  in  the  path  (hyperedges  already  in  the  path  are  not  re-added).  Its  search  terminates  when 
the  path  reaches  a  maximum  length  or  when  no  new  hyperedge  can  be  added.  Each  time  a 
hyperedge  is  added  to  the  path,  FindPath  stores  the  resulting  path  as  a  new  one.  All  the  paths  are 
passed  on  to  the  next  step  to  create  clauses. 

5.1.3  Clause  Creation  and  Pruning 

A  path  in  the  hypergraph  corresponds  to  a  conjunction  of  r(y1,...,yn)  hyperedges,  and  it 
guarantees  that  the  conjunction  has  at  least  one  support  in  the  hypergraph.  We  replace  each  yL  in 
a  path  with  a  variable,  thereby  creating  a  variabilized  atom  for  each  hyperedge.  We  convert  the 
conjunction  of  positive  literals  to  a  clause  because  that  is  the  form  that  is  typically  used  by  ILP 
and  MLN  structure  learning  and  inference  algorithms  usually.  In  Markov  logic,  a  conjunction  of 
positive  literals  with  weight  w  is  equivalent  to  a  clause  of  negative  literals  with  weight  —  w.  In 
addition,  we  add  clauses  with  the  signs  of  up  to  n  literals  flipped  (where  n  is  a  user-defined 
parameter),  since  the  resulting  clauses  may  also  be  useful.  We  evaluate  each  clause  using 
weighted  pseudo-log-likelihood  (WPLL)  [23]. 

We  iterate  over  the  clauses  from  shortest  to  longest.  For  each  clause,  we  compare  its  scores 
against  those  of  its  sub-clauses  (considered  separately)  that  have  already  been  retained.  If  the 
clause  scores  higher  than  all  of  these  sub-clauses,  it  is  retained;  otherwise,  it  is  discarded.  In  this 
manner,  we  discard  clauses  which  are  unlikely  to  be  useful.  Note  that  this  process  is  efficient 
because  the  score  of  a  clause  only  needs  to  be  computed  once,  and  can  be  cached  for  future 
comparisons.  (Alternatively,  we  could  evaluate  a  clause  against  all  its  sub-clauses  taken  together, 
but  this  would  require  re-optimizing  the  weights  for  each  combination  of  sub-clauses  for  every 
comparison,  which  is  computationally  expensive.) 

Finally  we  add  the  retained  clauses  to  an  MEN.  We  have  the  option  of  doing  this  in  several 
ways.  We  could  greedily  add  the  rules  one  at  a  time  in  order  of  decreasing  score.  After  adding 
each  rule,  we  relearn  the  weights,  and  keep  the  rule  in  the  MEN  if  it  improves  the  overall  WPEE. 
Alternatively,  we  could  add  all  the  rules  to  the  MEN,  and  learn  weights  using  LI  regularization 
to  prune  away  ‘bad’  rules  by  giving  them  zero  weights  [26].  Lastly,  we  could  use  L2- 
regularization  instead  if  the  number  of  rules  is  not  too  large,  and  rely  on  the  regularization  to 
give  ‘bad’  rules  low  weight.  Optionally,  we  discard  rules  containing  ‘dangling’  variables  (i.e., 
variables  which  only  appear  once  in  a  clause),  since  these  are  unlikely  to  be  useful. 
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5.2  Experiments 


We  carried  out  experiments  to  investigate  the  performance  of  LHL  on  three  datasets,  publicly 
available  at  http://alchemy.cs.washington.edu.  The  IMDB  dataset  was  created  by  [24]  from  the 
IMDB.com  database.  It  describes  a  movie  domain,  and  contains  predicates  describing  movies, 
actors,  directors,  and  their  relationships  (e.g,  WorkedIn(per son, movie),  etc.)  The  UW-CSE 
dataset,  prepared  by  [7],  describes  an  academic  department.  Its  predicates  describe  students, 
faculty,  and  their  relationships  (e.g,  AdvisedBv (person l,person2),  etc.).  The  Cora  dataset, 
originally  created  by  Andrew  McCallum,  is  a  collection  of  citations  to  computer  science  papers. 
Predicates  include:  SameCitation(cl,c2),  TitleHasWord(title,word),  etc.  The  IMDB,  UW-CSE, 
and  Cora  datasets  respectively  have  17,793,  260,254,  and  687,422  ground  atoms,  of  which 
1224,  2112,  and  42,558  are  true.  Each  dataset  is  divided  into  5  folds.  Note  that  the  primary  task 
in  the  Cora  domain  is  the  matching  of  entities,  i.e.,  the  citations,  and  their  author,  title  and  venue 
fields. 

We  compared  LHL  to  two  state-of-the-art  systems:  BUSL  [24]  and  MSL  [23].  Both  systems 
are  implemented  in  the  Alchemy  software  package  [17].  BUSL  uses  a  form  of  relational 
pathfinding  to  find  a  path  of  ground  atoms  in  the  training  data,  but  restricts  itself  to  very  short 
paths  (length  2)  to  avoid  fully  searching  the  large  space  of  paths.  It  then  greedily  pieces  the  path 
together  into  longer  ones.  MSL  uses  beam  search  to  search  for  clauses.  It  begins  from  an  empty 
clause,  and  systematically  generates  literals  that  can  be  used  to  extend  the  clause,  evaluating  each 
clause  thus  created  for  its  empirical  adequacy.  The  best  clause  it  finds  is  added  to  an  MLN,  and 
the  process  is  repeated  until  no  new  clauses  can  be  found  that  improves  the  MLN’s  fit  to  data. 

We  evaluated  the  performance  of  the  systems  according  to  how  well  they  predict  the 
groundings  of  each  predicate  given  groundings  of  all  other  predicates  as  evidence.  For  each 
dataset,  we  performed  cross-validation  using  the  five  previously  defined  folds.  To  evaluate  the 
performance  of  the  systems,  we  measured  the  average  conditional  log-likelihood  of  the  test 
atoms  (CLL),  and  the  area  under  the  precision-recall  curve  (AUC).  Table  7  shows  the  results  of 
the  systems.  From  the  table,  we  see  that  LHL  beats  BUSL  and  MSL  on  3  AUC  and  2  CLL 
scores,  but  does  worse  on  1  CLL  score.  The  runtimes  of  the  systems  also  suggest  that  LHL  scales 
better  than  the  other  systems. 


Table  7.  Experimental  Comparison  of  MLN  Structure  Learners 


Systems 

IMDB 

UW-CSE 

Cora 

AUC 

CLL 

Time(min) 

AUC 

CLL 

Time  (hr) 

AUC 

CLL 

Time  (hr) 

LHL 

0.73 

-0.13 

15.3 

0.22 

0.04 

7.3 

0.72 

-0.64 

13.6 

BUSL 

0.47 

-0.14 

4.7 

0.21 

0.05 

12.9 

0.17 

-0.37 

18.7 

MSL 

0.41 

-0.18 

0.2 

0.18 

0.57 

2.1 

0.17 

-0.37 

65.6 
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6.0  CONCLUSION  AND  RECOMMENDATION 


We  successfully  developed  the  approach  we  planned.  Our  SNE  system  for  extracting  semantic 
networks  from  text  is,  to  our  knowledge,  the  most  advanced  to  date  in  the  scale  and  accuracy  of 
the  entity,  schema  and  concept  matching  it  can  perfonn.  Our  unsupervised,  object-level  approach 
is  currently  the  state  of  the  art  for  coreference  matching  on  standard  datasets,  outperforming  even 
previous  supervised  approaches.  Generally  speaking,  unsupervised  object-level  matching  is  the 
superior  approach,  and  the  one  we  would  recommend  a  priori.  Our  LHL  system  for  learning 
background  knowledge  from  data  when  the  knowledge  is  not  available  a  priori  also  outperforms 
two  state-of-the-art  systems. 

We  originally  planned  to  experiment  on  a  variety  of  unstructured,  semi-structured  and 
structured  data  (e.g.,  free  text,  Web  pages  and  databases,  respectively).  However,  our 
experiments  focused  mainly  on  unstructured  data,  due  to  the  difficulty  in  obtaining  good  semi- 
structured  and  structured  datasets.  Although  the  latter  are  of  course  common  in  the  real  world, 
there  are  currently  no  standard  testbeds  available  that  simultaneously  include  entity,  schema  and 
concept  matching  problems.  This  is  not  surprising,  since  research  had  previously  not  progressed 
this  far,  but  a  supposedly-available  dataset  we  were  planning  to  use  turned  out  not  to  be. 

Our  work  is  important  to  data  integration  in  general  and  DARPA  in  particular  because  it  is 
the  first  to  jointly  handle  the  problems  of  entity,  schema  and  concept  matching.  Since  all  three 
are  usually  present  in  real-world  domains,  truly  effective  data  integration  cannot  be 
accomplished  without  it. 
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ACE 

BUSL 

DIESEL 

i.i.d. 

ILP 

IRM 

ITC 

LHL 

MAP 

MLN 

MRC 

MSL 

MUC 

NLP 

PSCG 

WPLL 


8.0  LIST  OF  ACRONYMS 

Automatic  Content  Extraction 
Bottom-Up  Structure  Learner 

Data  Integration  and  Exploitation  System  that  Learns 

independent  and  identically  distributed 

inductive  logic  programming 

Infinite  Relational  Model 

Information-Theoretic  Co-clustering 

Learning  via  Hypergraph  Lifting 

maximum  a  posteriori 

Markov  logic  network 

Multiple  Relational  Clustering 

Markov  logic  Structure  Learner 

Message  Understanding  Conference 

natural  language  processing 

preconditioned  scaled  conjugate  gradient 

weighted  pseudo-log-likelihood 
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